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Abstract 

The ideal genetic analysis of family data would include whole genome sequence on all family members. A strategy 
of combining sequence data from a subset of key individuals with inexpensive, genome-wide association study 
(GWAS) chip genotypes on all individuals to infer sequence level genotypes throughout the families has been 
suggested as a highly accurate alternative. This strategy was followed by the Genetic Analysis Workshop 18 data 
providers. We examined the quality of the imputation to identify potential consequences of this strategy by 
comparing discrepancies between GWAS genotype calls and imputed calls for the same variants. Overall, the 
inference and imputation process worked very well. However, we find that discrepancies occurred at an increased 
rate when imputation was used to infer missing data in sequenced individuals. Although this may be an artifact of 
this particular instantiation of these analytic methods, there may be general genetic or algorithmic reasons to 
avoid trying to fill in missing sequence data. This is especially true given the risk of false positives and reduction in 
power for family-based transmission tests when founders are incorrectly imputed as heterozygotes. Finally, we note 
a higher rate of discrepancies when unsequenced individuals are inferred using sequenced individuals from other 
pedigrees drawn from the same admixed population. 



Background 

The ideal genetic analysis of family data would include 
whole genome sequence data on all family members. To 
save cost, a procedure has been suggested to avoid having 
to sequence every individual [1]. In particular, this proce- 
dure uses dense sequence data on a subset of individuals 
and sparse, inexpensive, genome-wide association study 
(GWAS) chip genotypes on all individuals to infer 
sequence-level genotypes on the related, unsequenced 
individuals. The Genetic Analysis Workshop 18 (GAW18) 
data providers have followed these procedures as docu- 
mented in [2] . We examine the quality of the imputation 
to identify potential consequences for this approach. 

Methods 

The data 

The GAW18 data providers obtained family data from 
two studies: the San Antonio Family Heart Study and 
the San Antonio Family Diabetes/Gallbladder Study [3]. 
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The GAW18 data set contains GWAS chip genotypes 
on 959 individuals from 20 pedigrees. Of this sample, a 
subset of 464 individuals also has whole genome 
sequence data. Although four families have no 
sequenced individuals, the remaining pedigrees are 
sequenced for roughly half of their members (Table 1). 

Generation of the data by the GAW18 providers 

We will distinguish between two ways that missing data 
were "filled in" in the GAW18 data: filling in missing 
sequence data in the sequenced individuals will be 
referred to as "imputation," and inferring sequence-level 
data for individuals who were only genotyped using a 
GWAS chip will be referred to as "inference." We 
understand the imputation and inference process fol- 
lowed by the GAW18 data providers to consist of the 
following steps: (a) the GWAS chip data were phased 
(and any untyped GWAS chip alleles imputed) using 
MaCH [4], and a haplotype scaffolding for the families 
was created; (b) missing sequence data in the sequenced 
individuals were imputed using MaCH; (c) sequence 
haplotypes for the unsequenced individuals were 
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Table 1 Discrepancies by family 
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Discrepancies in the full comparison single-nucleotide polyntiorpliisms set between GWAS data and GENO data sets, by family, individuals sequenced and 
individuals imputed. Bold indicates highest discrepancy rate by subsample. 
N=number of individuals in the family 

D^number of discrepancies observed within the family 
D//V=average number of discrepancies observed per family 



inferred from the sequenced individuals using haplotype 
scaffolding derived from the GWAS chip data; 
(d) SimWalk2 was used to identifjf Mendelian errors and 
blank inconsistent genotypes; and (e) Merlin [5] was used 
to reimpute missing genotypes. The GAW18 data provi- 
ders supplied a dosage file for each sequence variant. They 
did not provide other quality scores for the imputation 
and inference. 

Selection of single-nucleotide polymorphisms and 
individuals for discrepancy evaluation 

The sequence data were provided in "VCF" (Variant Call 
Format) files. Starting with the 8,348,685 single-nucleo- 
tide polymorphisms (SNPs) that made it through quality 
control to end up in the final genotyping files [2], we 
used the VCF files to find uniquely occurring rs num- 
bers from dbSNP that mapped to an existing SNP in the 
GWAS sample. The SNPs were required to map to the 
same chromosome and have alleles that could be 
"flipped" to align the strands; that is, a SNP of the type 
A/C in GWAS and T/G in sequencing could be aligned, 
but a SNP of the type A/C could not be aligned with an 
A/G SNP. Furthermore, because the A/T and C/G poly- 
morphisms could have ambiguous alignment, these were 
discarded before comparisons were performed. Overall, 



this resulted in 451,279 SNPs for comparison. We will 
call these SNPs the "full comparison SNPs." 

We compared two data sets containing these 451,279 
SNPs: first, the "GENO" data set (the final, cleaned, 
sequence) and second, the "GWAS" data set (as pro- 
vided, assembled from three different GWAS chips [2]). 
Because of varying call rates, we did not count missing 
genotypes as discrepancies between the two files (i.e., a 
discrepancy was noted only when a genotype was pre- 
sent [i.e., called] in both data sets but was not identical). 

We identified one strong outlier: sequenced individual 
T2DG0400247 contained 11,576 discrepancies across 
the 451,279 SNPs. The next most discrepant individual 
had "only" 1880 discrepancies. The sample for individual 
T2DG0400247 may have somehow changed between the 
GWAS and the sequencing, perhaps because of a sample 
swap or contamination. This individual was removed 
from all of our subsequent analyses, leaving 958 indivi- 
duals with GWAS data and 463 individuals with whole 
genome sequencing. 

Genotyping for the GWAS was performed on several 
different Illumina platforms, resulting in a range of 
missing rates (because of different SNPs on the plat- 
forms). The SNPs called from sequencing also had a 
high variability in missing rate. Therefore, to have a 



Hinrichs et al. BMC Proceedings 2014, 8(Suppl 1):S17 
httpy/www.biomedcentral.coni/1753-6561/8/S1/S17 



Page 3 of 5 



frame of reference that avoids focusing on the GWAS 
genotyping process or sequencing process per se, we 
identified a subset of SNPs with a 98% or better call 
rate for both GWAS and sequencing. There were 
235,549 SNPs in the "high call rate" set. 

Evaluation of discrepancies 

We then examined discrepancies in four ways to help 
identify causes of discrepancy: by individual, by presence 
or absence of sequence data, by family, and by allele 
dosage estimate. We also divided the discrepancies by ana- 
lytical process: imputation (filling in missing genotypes in 
sequenced individuals) and inference (inferring sequence 
level data for individuals without sequencing). In the 
sequenced individuals, imputed genotypes were deter- 
mined by comparing the VCF (sequencing reads) file with 
the final genotype calls file, GENO. Genotypes that were 
missing in the VCF file for a sequenced individual but 
present in the GENO file were, by definition, imputed. 

Results 

We first present the results for the high call rate SNPs 
alone and then compare these with the results found in 
the full comparison SNPs set. In both cases, we look at 
the rate of discrepancies between the GWAS file (based 
on a genotype chip) and the GENO file (genotype calls 
based on sequence, imputation, or inference). 

Discrepancy rate for the high call rate single-nucleotide 

polymorphisms 

Overall discrepancy rate 

The overall discrepancy rate combined across 
sequenced, imputed and inferred genotype calls for this 
SNP set was low (Table 2). For this broadest group of 
subjects (958 individuals) in the high call rate SNPs 
(235,549 SNPs), calls were present in both the GWAS 
and the GENO data sets 99.46% of the time. Of these, 
197,984 were discrepant (0.09%) for an overall concor- 
dance rate of 99.91%. There were no discrepancies at all 
for 68.6% {N = 161,573) of these SNPs. 



Discrepancies in the sequenced individuals 

However, looking at discrepancy rate by call process 
yields a very different picture. As indicated in Table 2, 
when we compare the sequence calls with the GWAS 
chip calls in the 463 sequenced individual, we see a low 
discrepancy rate (0.03%). This accounted for the vast 
majority of the genotype calls for these individuals 
(99.8%). Imputation was used to fill in many of the 
missing calls in the sequenced individuals. The 0.2% of 
the genotype calls for these high call rate SNPs gener- 
ated by imputation yielded a surprisingly high discre- 
pancy rate of 25.2%. An analysis of these discrepancies 
yields some interesting results. This high discrepancy 
rate occurred even though the number of imputed indi- 
viduals for any given SNP was very low (at most 9 of 
the 463 individuals). Also, the majority of these discre- 
pancies (98.6%) consisted of a homozygote call from the 
GWAS chip and a heterozygote call from the imputa- 
tion. We believe that all imputed genotype calls 
included in the GENO file were nonambiguous. This is 
based on the data description (which states that likely 
imputation errors were left blank) combined with our 
observation that all genotypes with nonintegral dosage 
values (e.g., dosage = 0.001 or 0.999), as well as many 
additional genotypes with integral dosage values, were 
blank in the GENO file. 
Discrepancies in the nonsequenced individuals 
Not surprisingly, we see higher discrepancy rates in 
individuals whose sequences were inferred than in the 
sequenced individuals themselves. Having no sequenced 
family members clearly degraded the process further: 
the discrepancy rate for inferred individuals without 
genotyped family members was approximately twice the 
rate of discrepancies found in individuals with 
sequenced family members (Table 2, final two rows). 

Discrepancy rate for the full comparison single-nucleotide 
polymorphisms 

As would be expected, when we expand beyond the high 
call rate SNPs, the discrepancy rate increases. 



Table 2 Discrepancies by process type 
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Discrepancies between genome-wide association study (GWAS) data and GENO data sets, divided by analytical process. "Imputation" fills in missing genotypes in 
sequence data. "Inference" Infers phased sequence data on unsequenced Individuals based on GWAS data. SNP, single-nucleotide polymorphism. 
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Overall discrepancy rate 

The overall discrepancy rate combined across sequenced, 
imputed, and inferred genotype calls across the full 
comparison SNPs was also low (see Tables 1 and 2). In 
this group (958 individuals with 451,279 SNPs, for roughly 
432 million potential calls), calls were present in both the 
GWAS and the GENO data sets 97.6% of the time. Of 
these, slightly more than 1 million (1,099,402) were discre- 
pant (0.26%), for an overall concordance rate of 99.74%. 
We note that a small number of these SNPs (fewer than 
60) were outliers with an unusually high number of discre- 
pancies. (Two SNPs had more than 900 discrepancies, and 
8 SNPs had more than 800 discrepancies. These are likely 
to have been cases in which the sequenced SNP was not 
the SNP genotyped on the GWAS. None of these SNPs 
were in the high call rate SNPs. In the full sample, exclud- 
ing these SNPs results in a slight decrease in discrepancies 
in the second decimal place [results not shown].) 
Discrepancies in the sequenced individuals 
As was seen for the high call rate SNPs, an examination 
of discrepancy by source of the genotype call yields a 
much different picture. As indicated in Table 2, for the 
463 sequenced individuals, approximately 99.1% of the 
genotype calls in the GENO file were sequence reads. 
When we compare these sequence calls with the GWAS 
chip calls, we see a low discrepancy rate overall (0.06%). 
Although this is twice the rate found for the high call 
rate SNPs, it still strongly supports the supposition that 
both the sequencing calls and the genotype chip calls 
are highly reliable. However, attempting to impute the 
last 0.9% yielded a high discrepancy rate of 28.8%, 
slightly higher than the 25.2% rate found for the high 
call rate SNPs. An analysis of these discrepancies yields 
some interesting results. As was true for the smaller 
SNP set, the majority of these discrepancies (98.6%) 
consisted of a homozygote call from the GWAS chip 
and a heterozygote call from the imputation. In all of 
these cases as well, the evidence suggests that there was 
no ambiguity in any of these calls. 
Discrepancies in the nonsequenced individuals 
Not surprisingly, we see higher discrepancy rates in indivi- 
duals whose sequence was inferred than in the sequenced 
individuals themselves. Having no sequenced family mem- 
bers clearly degraded the process further: the discrepancy 
rate for inferred individuals without genotyped family 
members was approximately twice the rate of discrepan- 
cies found in individuals with sequenced family members 
(Table 2, final two rows). A breakdown of the discrepancy 
rate by family and sequenced versus inferred family 
members is provided in Table 1. 

Discussion and conclusions 

Our ability to critique the overall imputation and inference 
process was limited by the absence of imputation quality 



measures in the distributed data. For the reasons noted, 
we believe that only unambiguous imputation calls were 
included in the data. Almost all of the discrepancies 
between imputed and GWAS chip genotypes in sequenced 
individuals involved heterozygous imputation calls. Even 
though future researchers may be unlikely to follow this 
exact method, we believe our results highlight several 
generally applicable points: First, this is a cautionary 
paper. The highly skilled providers of the GAW18 data, in 
collaboration with one of the founders of the field of 
genetic imputation, provided data to the GAW partici- 
pants that were unreliable in some places. Clearly, evaluat- 
ing imputation quality is critical when using imputed data. 
However, the standard quality scores provided by the 
imputation programs, such as the allelic [6] do not take 
chance agreement into account, which is particularly pro- 
blematic for rare variants. Nonetheless, the IQS [7], which 
does takes chance agreement into account, is still not as 
widely used as might be warranted. Although a straightfor- 
ward application of the IQS requires true genotypes for 
comparison, multiple approximations have been suggested 
(e.g., [8]). 

Because of the complex imputation and inference 
method used by the GAW18 data providers, it is unclear 
at what point the discrepant heterozygotes were intro- 
duced into the process. It may be that these were intro- 
duced by Merlin when calculating the probabilities of 
each possible genotype for missing data in the complete 
pedigrees. If this is the case, then a heterozygote may be 
computationally very likely: a pair of heterozygous foun- 
ders is completely compatible with all possible offspring 
genotypes; deviations from Hardy- Weinberg (caused by 
the Wahlund effect, for example) may cause expected 
heterozygosity to be higher than observed; and in the 
case of deletions, the apparent genotyping errors that 
can occur from the transmission of the "null" allele can 
be resolved by assuming that the parent is, in fact, a 
heterozygote. It may be better to allow missing sequence 
data to remain missing and accept the inference of miss- 
ing genotypes on related individuals than to use highly 
likely genotypes inferred from the pedigree data. This 
may be especially important when studying rare variants 
because a modest number of inferred heterozygous 
founders could greatly influence transmission tests. 

Despite these cautions, we see clearly that the process 
of sequencing some individuals and inferring genotypes 
for related individuals produces high-quality data; this 
provides a very good first step in filling the gap until 
complete data are available. The inference process is 
less robust when using sequence data from unrelated 
individuals. 

Overall, the inference and imputation process worked 
very well. It is clear that GWAS chip data can be phased 
with high accuracy and sequence data can be inferred 
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with high concordance to the GWAS chip genotypes. 
There are three conclusions from our investigation: (a) 
there is a very high concordance rate between genotypes 
obtained from sequencing and those from a GWAS array, 
(b) When sequencing results in missing genotypes, it may 
be best to retain the missingness. If sequencing failed in a 
region because of abnormalities such as deletions, the 
imputation process may not have appropriate reference 
data to work with. This is especially true in the case of 
rare variants, with which incorrectly imputing a heterozy- 
gote may reduce the power or create false positives for 
transmission disequilibrium tests, (c) The imputation and 
inference process may result in final data that is discrepant 
from the original GWAS data. As a consequence, it may 
be prudent to incorporate the GWAS chip genotype calls 
into the final data set used for analysis (e.g., blanking 
discrepant genotypes). 
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